Leveraging machine learning to assess market-level food safety and zoonotic disease risks in China

While many have advocated for widespread closure of Chinese wet and wholesale markets due to numerous zoonotic disease outbreaks (e.g., SARS) and food safety risks, this is impractical due to their central role in China’s food system. This first-of-its-kind work offers a data science enabled approach to identify market-level risks. Using a massive, self-constructed dataset of food safety tests, market-level adulteration risk scores are created through machine learning techniques. Analysis shows that provinces with more high-risk markets also have more human cases of zoonotic flu, and specific markets associated with zoonotic disease have higher risk scores. Furthermore, it is shown that high-risk markets have management deficiencies (e.g., illegal wild animal sales), potentially indicating that increased and integrated regulation targeting high-risk markets could mitigate these risks.

WSM/WM market-level food adulteration risk scores from public data and demonstrating their correlation with zoonotic disease in China. In particular, the paper provides empirical support to the hypothesis that these two types of risks (i.e., food adulteration and zoonotic disease risks) are correlated, potentially because they are both affected by common system-level environmental and management deficiencies in WSMs and WMs. Moreover, this suggests that assessment of food adulteration risks can meaningfully inform the assessment of zoonotic disease risks at the market level. This observed correlation stands in contrast to the fact that zoonotic disease and food safety risks are currently regulated by different Chinese government organizations.
To accomplish this, the paper leverages a self-constructed dataset of 4.0 million public (passing and failing) records of food adulteration tests conducted by China's state (central), provincial and prefecture-level Administrations for Market Regulation (AMRs), which are the primary organizations responsible for testing food products in WSMs/WMs 9,[21][22][23] . This dataset integrates all tests conducted between 2014 and 2020 and posted by the national AMR, all 31 provincial AMRs, and 273 of 333 prefecture AMRs, covering nearly all major cities and important agricultural areas. Data integrity was ensured through extensive pre-processing of the raw data, including deduplication of records to avoid double counting of food safety test records, assessing the reporting rates per location with respect to the population size, and more generally checking for other potential data anomalies, such as missing values (Full details about dataset creation can be found in the Supplementary Materials and our previous paper 9 ). Furthermore, this dataset is quite representative of true food adulteration risks throughout China, as the AMR agencies are legally required to publicly post all food test results on their respective websites 24 . Moreover, AMR's policy instructs a random sampling approach within each product category, which implies that the number of tests at each market is roughly proportional to the volume of products per respective category sold in the market 25 . Finally, all AMR food safety tests conform to the same technical testing methods and standards defined in China's national testing plan 15 .
This dataset includes 79,177 test records of animal products associated with WSMs/WMs, and an innovative unsupervised clustering methodology is developed to associate each test record with its respective market. This methodology yields a list of WSMs/WMs in China, as well as a food adulteration risk score, for each market, which is equal to its respective failure rate in the AMR food safety test dataset. The high-risk markets are defined as the markets with top highest 20% food adulteration risk score (equals to the failure rate of AMR testing records associated with this market) among all markets. The total volume of animal products sold through high-risk markets in each province provides province-level risk scores. While these risk scores are not intended to capture all nuances of food safety tests, such as adjustments for specific types of animals tested, they create significant decision support information for regulators. Notably, there previously existed no publicly available and comprehensive list of WSMs/WMs in China, let alone a method for curating all AMR tests at each market. This motivates the creation of an operational tool to communicate information to regulators, policy makers, and academics.
The association of food adulteration risk scores to zoonotic disease risks is validated through two analyses. First, a carefully designed linear regression indicates that province-level risk scores correlate positively with respective number of zoonotic influenza isolates (cases) in humans identified from the OpenFlu database 26 . This association holds when controlling for covariates, such as livestock production/slaughter, and distribution of markets in each province. An overview of the analysis is shown in Fig. 1. Second, specific markets associated with zoonotic outbreaks have higher risk scores than expected by chance. These analyses provide evidence that constructed risk scores give a single interpretable measure able to capture important aspects of both food adulteration and zoonotic disease risks at the market and province levels.

Materials and methods
Data. To assess food adulteration risk, the paper relies on a self-constructed dataset of food safety tests records conducted by China's AMR. The AMR is the primary regulatory organization responsible for testing food products at WSMs and WMs 9,21-23 . The state/province/prefecture AMRs are also required to publicly post all food test results on their respective websites by the State Council of China 24 . Furthermore, the AMR follows a random sampling approach, meaning that the number of tests at each market is roughly proportional to the volume of products sold at the market, and that when AMR inspectors visit markets, they randomly select products for food safety testing 25 . Additionally, all AMR food safety tests conform to the methods and standards defined in China's national testing plan, including which adulterants to test for and what legal limits define a failing test 15 . We have conducted several robustness checks and verified that there are actually slight decreases in the failure rate over time, which are statistically significant given the large amount of random sampling done by the AMR. Thus, these AMR food safety tests are comparable, and public postings offer an unbiased glimpse into food adulteration risks at various markets.
This AMR dataset integrates and structures 4.0 million food safety tests conducted between 2014 and 2020 and collected from postings by the state AMR, all 31 provincial AMRs, and 273 of 333 prefecture AMRs, covering nearly all major cities and important agricultural areas. This constitutes the largest and most representative set of Chinese food safety tests anywhere in the literature. See Additional Details A for more details, including data fields used in the analysis.
To map zoonotic disease outbreaks, the OpenFlu database was used to measure the number of unique zoonotic (i.e., avian and swine influenza) influenza isolates (cases) found in humans in China 26 [27][28][29] . Notes: *The OpenFluDB isolates correspond to a unique "sample of a zoonotic flu virus sequence isolated from a human patient". Additionally, the data used in this analysis was reviewed in order to ensure each isolate also corresponded to a unique patient/case (although not necessarily a unique strain of virus).
To understand potential risks for zoonotic disease outbreaks, at markets with high food adulteration risk scores, Baidu searches (see details in Additional Details D1) are used to collect news stories related to systemlevel deficiencies at markets, such as illegal wild animal sales, sanitation, and lack of quarantine inspection for slaughtered animals. These deficiencies are unlikely to have direct effect on the food adulteration risk scores (i.e., AMR test failures due to adulteration).

Market-level food adulteration risk scores.
To develop market-level food adulteration risk scores, 79,177 test records of animal products (e.g., poultry, pork) related to WSMs/WMs are extracted from the AMR dataset (see Additional Details A for details).
Associating each record with a specific market requires development of a new machine learning clustering algorithm. The challenge stems from several reasons. One is the lack of a publicly available, comprehensive list of markets, requiring unsupervised clustering of 'similar' records (i.e., from the same market). AMR records are also inconsistent in how they specify names/addresses of the respective businesses. A third challenge is that common string distance metrics (e.g., Levenshtein distance) used by unsupervised text clustering algorithms work poorly on Chinese text 30 . More specifically, traditional machine learning methods and existing packages designed for text in Chinese (e.g., the Python package Jieba) are already widely available. However, Jieba is more focused on tokenizing sentences into vocabularies (nouns, verbs, etc.) and doesn't work well for clustering different groups of nouns. This presents a challenge for the task of associating test records with specific market, since both the records and the market names often consist of a collection of nouns and do not constitute sentences. In particular, the accuracy obtained by existing clustering packages was well below 50%. Addressing this challenge required the development of technical approaches complementary to the existing machine learning methodologies and packages (Jieba, Jaccard similarity scores, etc.). Specifically, an entirely new machine learning tokenization approach was developed, specifically designed for this problem, allowing the application of traditional clustering algorithms.

IdenƟfy AMR Database Test Records of Animal Products at Markets
Step II

AMR Food AdulteraƟon Database IntegraƟon
Step I

Unsupervised Clustering to Connect Records to Specific Markets
Step IIII Step 0.

AMR Dataset Location Name
Step 3. Tokenization of 'Potential Proper Name' for Jaccard Similarity Score Wuhan Huanan Seafood Gaoshi Step 1.

Identify Constituent Parts (Geographic and Market Keywords)
Step 2.
Identify 'Potential Proper Name' (Between

Province Risk Scores (Number of AMR Tests at Markets in the Top 20% of Risk Scores)
Step V

Market Risk Scores (Failure Rates)
Step IV

Province Analysis (AssociaƟon of Province Risk Scores with ZoonoƟc Flu Isolates in Humans)
Step VI Province-Level Risk Scores Figure 1. Overview of the risk score analysis to evaluate association of province food adulteration risk scores with zoonotic flu isolates found in humans. AMR dataset tests of animals are clustered to specific markets in order to create market level risk scores. These risk scores are then aggregated at the province level, and the association against province-level zoonotic flu isolates in humans is assessed. Full details can be found in the Supplementary Materials. This graph was created by Microsoft PowerPoint. www.nature.com/scientificreports/ The new tokenization methodology relies on the specific structure of market names in Chinese text. Market names follow two conventions. The first convention is: "geographic keyword" + "proper name" + "market keyword" (e.g., 'Wuhan Huanan Seafood Market' , with proper name 'Huanan'). The second convention is the same without "proper name" (e.g., 'Wuhan Vegetable Wholesale Market').
Notes: ‡"geographic keyword" means geographic location, such as province/prefecture/county name, "proper name" stands for the unique part of the market name representing the specific market, "market keyword" is a keyword indicating the record relates to a market.
The clustering algorithm leverages this as follows. First, the 'geographic keyword' and 'market keyword' are identified through substring matching. Then, the substring between these, referred to as the 'potential proper name', is extracted. If there is no such substring, the assumption is that the record relates to a market without a proper name. Figure 2 visualizes this process.
Records are clustered within each prefecture, leveraging Jaccard similarity scores, with final clusters corresponding to unique markets (see Additional Details B for full details of the Jaccard similarity score calculations) 31 . Clustering accuracy is evaluated out-of-sample for all 10,270 market-related records from Zhejiang, Hunan, and Guangdong, marking an entire cluster and all associated records are marked as incorrect if there is a single identification mistake. Finally, risk scores for each market are calculated as the failure rates of the respective AMR tests. This is done for all markets with 20+ AMR tests, to ensure robustness of the failure rates. An additional check shows that risk scores are robust to potentially biased reporting in the AMR dataset (see Additional Details C for details).
Association between risk scores and zoonotic disease risks. The paper presents three analyses to establish the association between the food adulteration risk scores and zoonotic disease related risks.
First, a linear regression model is used to evaluate the association of the food adulteration risk scores with province-level risk of zoonotic disease outbreaks. Specifically, the dependent variable is defined as the provincelevel number of human isolates of zoonotic flu from the OpenFlu database (see "Materials and methods" section).
The key independent variable is the total province-level number of AMR tests at high-risk markets (top 20% of risk scores). Because the AMR follows a random sampling approach, this is a proxy for the volume of animals sold through high-risk markets 25 . Multiple control variables are included, most notably production/slaughter volumes by province, as well as the number of AMR tests of animals conducted at all markets in the province. Because OpenFlu isolates are related to avian/swine influenza, these variables should control against known transmission vectors. All model variables and descriptions are shown in Table 1.
Linear regression models are estimated in-sample using variables from Table 1. First, best-fit model is identified minimizing Akaike information criterion (AIC) across all combinations of variables. The benchmark model is identified likewise but excluding the key independent variable. AIC difference is compared, to heuristically evaluate model improvement 32 . See Additional Details E for details related to linear regression data assumptions.
Second, risk scores of all specific markets associated with zoonotic diseases outbreaks are considered. Under the null hypothesis (i.e., market risk scores are not associated with outbreaks), the market risk score percentiles for these markets distribute uniformly between 0 and 100%. The probability (one-sided test) that the two markets are in the top x% of risk scores, under the null hypothesis, is p = x 100 2 . Third, the paper evaluates association of market risk scores with existence of negative news related to systemlevel environmental and management deficiencies that are risk factors for zoonotic disease outbreaks (e.g., allowing sale of illegal wild animals-see Sect. 2a). The 40 highest-risk markets are each paired with a lowerrisk market, from the same geography and with approximately the same number of AMR tests (see details in

Results
Clustering results. The results of the clustering algorithm (assigning AMR records to specific markets) are shown in Table 2 below. Note that the 853 high-volume markets (i.e., 20+ AMR tests) cover 54% of all tests at markets and we chose those markets for our analysis. Because large markets consolidate 70-80% of the supply for most agricultural products, it follows from Table 2 that high-volume markets distribute a substantial amount of the animal supply. Based on manual review, the out-of-sample clustering accuracy for all 10,270 market-related records from Zhejiang, Hunan, and Guangdong provinces is between 89.02 and 93.44% at the record-level, and 91.7% to 94.9% at the cluster-level (see Supplementary Materials for additional details).
After clustering, markets are assigned food adulteration risk scores equal to their respective failure rates in the AMR database. These risk scores were used to create an operational tool for regulators, communicating the results of this market risk score analysis. See Fig. 3 below for an overview of the market risk score tool.  Table 2. Clustering results: number of markets identified and distribution of tests per market. The first column represents different inclusion thresholds for markets to be included in the analysis as per the number of tests associated with the respective market. The 2nd to 5th columns provide statistics on the markets included in the group: number of markets in the group, the fraction these markets out of all markets, the total number of tests associated with markets in the group, the fraction these tests out of all tests and the average risk score of the markets in the group. For example, if we choose markets that have at least 100 tests (first row), then only 0.6% of all markets and 17% of all tests will be covered in our analysis. Selected markets are in bold. www.nature.com/scientificreports/ Association between provincial food adulteration risk scores and zoonotic disease risk. The correlation between province-level risk scores and respective zoonotic flu isolates (cases) in humans is shown in Fig. 4. A linear regression model is used to evaluate association of province-level risk scores with zoonotic influenza cases in humans, while controlling for covariates, such as livestock production, slaughter, and the distribution of markets in each province. All model variables and descriptions and the best-fit linear regression model are shown in Tables 1 and 3, respectively. The best-fit model achieves an AIC of 268.59, and the benchmark model (only using control variables) an AIC of 275.33. This AIC improvement exceeds the common heuristic of 2.0 used to show model improvement 32 . The in-sample coefficient of determination (R 2 ) improves from 0.421 to 0.534 with the addition of the key independent variable (number of AMR tests at high-risk markets). Furthermore, as observed in Table 3, the model coefficient for AMR tests at high-risk markets is significant (p < 1 * 10 -5 ).

Percent of markets covered (%)
Risk scores at specific markets implicated in spreading zoonotic disease. Zoonotic disease outbreaks have been tied to specific markets twice per literature review. Specifically, the Wuhan Huanan Seafood Market, associated with the first COVID-19 cluster, is among the top 6% of risky markets 11 . Likewise, the Beijing    Published media news related to high-risk markets. Finally, the association of market risk scores with news related to system-level environmental and management deficiencies was analyzed, through a paired analysis of the 40 highest risk markets, and 40 lower risk markets normalized on geography and market size. Negative news was found for 27 out of 40 high-risk markets and 13 of 40 lower-risk markets, indicating statistical significance (two sample binomial test, p = 0.0017). Notably, two of the high-risk markets had news stories related to illegal sale of wild animals, which is especially risky for zoonotic disease.

Discussion and conclusion
While many researchers have noted the link between Chinese WSMs/WMs and zoonotic disease outbreaks, this has not led to many actionable recommendations [1][2][3][4][5][6][7]13 . In fact, the only such recommendation identified in the literature was calls for closure of all WSMs and WMs 17,18 . However, this approach lacks practicality and would have serious secondary consequences because of the centrality of these markets in China's food system. As noted in the introduction, these markets are major distribution points in the Chinese agricultural supply chain, consolidating 70-80% of supply for many agricultural products 9,10,19 . Closing all markets would adversely affect the livelihood of millions of market vendors, farmers, and transporters 10 .
Addressing an important need, this research provides an important foundation to a new and more targeted risk management approach to address zoonotic disease and food adulteration risks posed by WSMs and WMs 10 . Specifically, the paper develops individual market-level food adulteration risk scores based on publicly posted AMR data. Empirical analyses show that these risk scores correlate with zoonotic disease risks. First, it is shown that province-level risk scores are significantly associated with the respective number of zoonotic flu isolates identified in the province. It is also shown that market-level risk scores are significantly higher at markets associated with zoonotic disease outbreaks. These analyses support the hypothesis that food adulteration and zoonotic disease related risks are driven, at least partially, by common system-level environmental and management deficiencies in WSMs and WMs. A media news analysis provides additional evidence that markets with high risk-scores are significantly more likely to have negative news related to sanitation or management deficiencies that directly drive zoonotic disease risk (e.g., allowing the illegal sales of wild animals), further supporting the hypothesis.
The methodology developed in this paper is also aligned with China's current policy approach to food safety management and regulation. In recent years, the Chinese government has invested substantial efforts into improving food safety, notably by publishing food inspection results, and increasing transparency for both public complaints about food safety and media reporting, all with the goal of increasing access to relevant information 24,34 . The analyses in this paper illustrate how this data could enable market-level risk assessment. China has already acknowledged the strategic importance of WSMs and WMs and has emphasized the need for increased monitoring and inspections of markets 15 . Therefore, the risk-based methodology in this paper may support this effort by allowing the Chinese government to pursue more targeted policies that focus on high-risk markets and lower both food adulteration risk and zoonotic disease risk. The ability to better identify risk at the local (market) level is particularly critical given the size of the food supply chain in China and the decentralized regulatory environment. Another important insight that emerges from the analyses in the paper is that food safety and zoonotic disease risks should regulated in integrated manner.
The work in this paper is not only a first-of-its-kind approach to use advanced data analytics for the identification of specific risky supply chain nodes (markets), but it also provides empirical evidence for the hypothesis that food adulteration and zoonotic diseases risks share common drivers, which can directly inform public health policies. While more research is needed to better understand this association and identify the specific risk drivers, this approach has the potential to offer entirely new ways of understanding why zoonotic disease outbreaks occur, as well as complementary regulatory approaches to reducing pandemic risks.

Data availability
All data, code, and materials used in the analysis are available upon request by the corresponding author. The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.